MSstats: Protein/Peptide significance analysis

Package: MSstats

Author: Anshuman Raina

Date: 5th Semptember 2024

Introduction

MSstats, an R package in Bioconductor, supports protein significance analysis for statistical relative quantification of proteins and peptides in global, targeted and data-independent proteomics. It handles shotgun, label-free and label-based (universal synthetic peptide-based) SRM (selected reaction monitoring), and SWATH/DIA (data independent acquisition) experiments. It can be used for experiments with complex designs (e.g. comparing more than two experimental conditions, or a time course).

This vignette summarizes the introduction and various options of all functionalities in MSstats. More details are available in User Manual.

Installation

To install this package, start R (version “4.0”) and enter:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("MSstats")
library(MSstats)

1. Workflow

1.1 Raw Data

To begin with, we will load sample datasets, including both annotated and plain data. The dataset you need can be found here.

We will also load the Annotation Dataset using MSstatsConvert. You can access this dataset here.

library(MSstats)

# Load data
pd_raw = system.file("tinytest/raw_data/PD/pd_input.csv", 
                    package = "MSstatsConvert")

annotation_raw = system.file("tinytest/raw_data/PD/annot_pd.csv", 
                   package = "MSstatsConvert")

pd = data.table::fread(pd_raw)
annotation = data.table::fread(annotation_raw)

head(pd, 5)
##    Confidence.Level Search.ID Processing.Node.No      Sequence Unique.Sequence.ID PSM.Ambiguity
##              <char>    <char>              <int>        <char>              <int>        <char>
## 1:             High         A                  4     SLIASTLYR               1327   Unambiguous
## 2:             High         A                  4   AYLATQGVEIR               2889   Unambiguous
## 3:             High         A                  4 NHEIIGDIVPLAK               4700   Unambiguous
## 4:             High         A                  4 NHEIIGDIVPLAK               4700   Unambiguous
## 5:             High         A                  4  YHVNQYTGDESR               5209   Unambiguous
##                                                                                                    Protein.Descriptions
##                                                                                                                  <char>
## 1:                                       Uridine kinase OS=Escherichia coli (strain K12) GN=udk PE=3 SV=1 - [URK_ECOLI]
## 2: Imidazole glycerol phosphate synthase subunit HisF OS=Escherichia coli (strain K12) GN=hisF PE=1 SV=1 - [HIS6_ECOLI]
## 3: Imidazole glycerol phosphate synthase subunit HisF OS=Escherichia coli (strain K12) GN=hisF PE=1 SV=1 - [HIS6_ECOLI]
## 4: Imidazole glycerol phosphate synthase subunit HisF OS=Escherichia coli (strain K12) GN=hisF PE=1 SV=1 - [HIS6_ECOLI]
## 5: Imidazole glycerol phosphate synthase subunit HisF OS=Escherichia coli (strain K12) GN=hisF PE=1 SV=1 - [HIS6_ECOLI]
##    X..Proteins X..Protein.Groups Protein.Group.Accessions Modifications Activation.Type DeltaScore DeltaCn
##          <int>             <int>                   <char>        <char>          <char>      <int>   <int>
## 1:           1                 1                   P0A8F4                           CID          1       0
## 2:           1                 1                   P60664                           CID          1       0
## 3:           1                 1                   P60664                           CID          1       0
## 4:           1                 1                   P60664                           CID          1       0
## 5:           1                 1                   P60664                           CID          1       0
##     Rank Search.Engine.Rank Precursor.Area QuanResultID Decoy.Peptides.Matched Exp.Value Homology.Threshold
##    <int>              <int>          <num>       <lgcl>                  <int>     <num>              <int>
## 1:     1                  1       3.26e+07           NA                     NA   2.7e-01                 13
## 2:     1                  1       2.71e+08           NA                     NA   8.4e-05                 13
## 3:     1                  1       1.40e+08           NA                     NA   6.6e-03                 13
## 4:     1                  1       2.13e+08           NA                     NA   4.5e-04                 13
## 5:     1                  1       5.43e+06           NA                     NA   3.8e-02                 13
##    Identity.High Identity.Middle IonScore Peptides.Matched X..Missed.Cleavages Isolation.Interference....
##            <int>           <int>    <int>            <int>               <int>                      <int>
## 1:            13              13       19                6                   0                         53
## 2:            13              13       54                9                   0                         25
## 3:            13              13       35               10                   0                         64
## 4:            13              13       46               10                   0                         50
## 5:            13              13       27                3                   0                         29
##    Ion.Inject.Time..ms. Intensity Charge m.z..Da. MH...Da. Delta.Mass..Da. Delta.Mass..PPM. RT..min.
##                   <int>     <num>  <int>    <num>    <num>           <int>            <num>    <num>
## 1:                    3   1590000      2 512.2952 1023.583               0            -0.17    48.61
## 2:                    0  17200000      2 610.8357 1220.664               0             0.67    45.31
## 3:                    1   3100000      3 473.6051 1418.801               0             0.35    58.58
## 4:                    3   2020000      2 709.9044 1418.802               0             0.92    58.53
## 5:                   12    579000      2 734.8257 1468.644               0            -0.74    23.52
##    First.Scan Last.Scan MS.Order Ions.Matched Matched.Ions Total.Ions
##         <int>     <int>   <char>       <char>        <int>      <int>
## 1:      14971     14971      MS2       Jun-74            6         74
## 2:      13599     13599      MS2       Sep-98            9         98
## 3:      19004     19004      MS2        8/128            8        128
## 4:      18981     18981      MS2       14/128           14        128
## 5:       4707      4707      MS2        8/112            8        112
##                                      Spectrum.File Annotation
##                                             <char>     <lgcl>
## 1: 121219_S_CCES_01_01_LysC_Try_1to10_Mixt_1_1.raw         NA
## 2: 121219_S_CCES_01_01_LysC_Try_1to10_Mixt_1_1.raw         NA
## 3: 121219_S_CCES_01_01_LysC_Try_1to10_Mixt_1_1.raw         NA
## 4: 121219_S_CCES_01_01_LysC_Try_1to10_Mixt_1_1.raw         NA
## 5: 121219_S_CCES_01_01_LysC_Try_1to10_Mixt_1_1.raw         NA
head(annotation, 5)
##                                                Run  Condition BioReplicate
##                                             <char>     <char>        <int>
## 1: 121219_S_CCES_01_01_LysC_Try_1to10_Mixt_1_1.raw Condition1            1
## 2: 121219_S_CCES_01_02_LysC_Try_1to10_Mixt_1_2.raw Condition1            1
## 3: 121219_S_CCES_01_03_LysC_Try_1to10_Mixt_1_3.raw Condition1            1
## 4: 121219_S_CCES_01_04_LysC_Try_1to10_Mixt_2_1.raw Condition2            2
## 5: 121219_S_CCES_01_05_LysC_Try_1to10_Mixt_2_2.raw Condition2            2

1.2 Converters

We have the following converters, which allow you to convert various types of output reports which include the feature level data to the required input format of MSstats.

  1. DIANNtoMSstatsFormat
  2. DIAUmpiretoMSstatsFormat
  3. FragPipetoMSstatsFormat
  4. MaxQtoMSstatsFormat
  5. OpenMStoMSstatsFormat
  6. OpenSWATHtoMSstatsFormat
  7. PDtoMSstatsFormat
  8. ProgenesistoMSstatsFormat
  9. SkylinetoMSstatsFormat
  10. SpectronauttoMSstatsFormat

We show an example of how to use the above said Formatters.

input_data = read.csv("../data/diann_report.tsv", 
                      sep="\t")
head(input_data)
##                                            File.Name                                              Run
## 1   230719_THP-1_Chrom_end2end_Plate1_DMSO_A10_DIA.d   230719_THP-1_Chrom_end2end_Plate1_DMSO_A10_DIA
## 2   230719_THP-1_Chrom_end2end_Plate1_DMSO_A05_DIA.d   230719_THP-1_Chrom_end2end_Plate1_DMSO_A05_DIA
## 3   230719_THP-1_Chrom_end2end_Plate1_DMSO_A12_DIA.d   230719_THP-1_Chrom_end2end_Plate1_DMSO_A12_DIA
## 4 230719_THP-1_Chrom_end2end_Plate1_Jakafi_A08_DIA.d 230719_THP-1_Chrom_end2end_Plate1_Jakafi_A08_DIA
## 5  230719_THP-1_Chrom_end2end_Plate1_DbET6_B08_DIA.d  230719_THP-1_Chrom_end2end_Plate1_DbET6_B08_DIA
## 6  230719_THP-1_Chrom_end2end_Plate1_DbET6_G10_DIA.d  230719_THP-1_Chrom_end2end_Plate1_DbET6_G10_DIA
##   Protein.Group Protein.Ids Protein.Names    Genes PG.Quantity PG.Normalised PG.MaxLFQ Genes.Quantity
## 1        Q9NVA2      Q9NVA2   SEP11_HUMAN SEPTIN11     79142.6       46870.4   49600.6        79142.6
## 2        Q9NVA2      Q9NVA2   SEP11_HUMAN SEPTIN11     97279.9       58082.6   55838.1        97279.9
## 3        Q9NVA2      Q9NVA2   SEP11_HUMAN SEPTIN11     75382.3       39671.6   46009.7        75382.3
## 4        Q9NVA2      Q9NVA2   SEP11_HUMAN SEPTIN11     99529.9       49235.4   52154.0        99529.9
## 5        Q9NVA2      Q9NVA2   SEP11_HUMAN SEPTIN11     88499.7       52884.9   56081.4        88499.7
## 6        Q9NVA2      Q9NVA2   SEP11_HUMAN SEPTIN11     74063.6       64431.5   61170.8        74063.6
##   Genes.Normalised Genes.MaxLFQ Genes.MaxLFQ.Unique             Modified.Sequence   Stripped.Sequence
## 1          46870.4      49600.6             49600.6 AAAQLLQSQ(UniMod:7)AQQSGAQQTK AAAQLLQSQAQQSGAQQTK
## 2          58082.6      55838.1             55838.1 AAAQLLQSQ(UniMod:7)AQQSGAQQTK AAAQLLQSQAQQSGAQQTK
## 3          39671.6      46009.7             46009.7 AAAQLLQSQ(UniMod:7)AQQSGAQQTK AAAQLLQSQAQQSGAQQTK
## 4          49235.4      52154.0             52154.0 AAAQLLQSQ(UniMod:7)AQQSGAQQTK AAAQLLQSQAQQSGAQQTK
## 5          52884.9      56081.4             56081.4           AAAQLLQSQAQQSGAQQTK AAAQLLQSQAQQSGAQQTK
## 6          64431.5      61170.8             61170.8           AAAQLLQSQAQQSGAQQTK AAAQLLQSQAQQSGAQQTK
##                     Precursor.Id Precursor.Charge     Q.Value         PEP Global.Q.Value Protein.Q.Value
## 1 AAAQLLQSQ(UniMod:7)AQQSGAQQTK2                2 6.59541e-03 1.87628e-01    1.09371e-02     0.000293945
## 2 AAAQLLQSQ(UniMod:7)AQQSGAQQTK2                2 5.09463e-03 1.58092e-01    1.09371e-02     0.000301841
## 3 AAAQLLQSQ(UniMod:7)AQQSGAQQTK2                2 3.51206e-04 4.28005e-03    1.09371e-02     0.000279096
## 4 AAAQLLQSQ(UniMod:7)AQQSGAQQTK2                2 7.36836e-04 1.27557e-02    1.09371e-02     0.000293341
## 5           AAAQLLQSQAQQSGAQQTK2                2 6.73729e-03 1.99875e-01    4.46056e-03     0.000301750
## 6           AAAQLLQSQAQQSGAQQTK3                3 1.32103e-05 1.32103e-05    2.78130e-09     0.000298775
##    PG.Q.Value Global.PG.Q.Value  GG.Q.Value Translated.Q.Value Proteotypic Precursor.Quantity
## 1 0.000293169       0.000270051 0.000293945                  0           1            132.008
## 2 0.000300842       0.000270051 0.000301841                  0           1            110.006
## 3 0.000278319       0.000270051 0.000279096                  0           1            765.040
## 4 0.000292312       0.000270051 0.000293341                  0           1            331.017
## 5 0.000301023       0.000270051 0.000301750                  0           1            328.017
## 6 0.000297619       0.000270051 0.000298775                  0           1          11670.300
##   Precursor.Normalised Precursor.Translated Translated.Quality Ms1.Translated Quantity.Quality      RT
## 1              69.2365              115.112                 NA        27216.6         0.816497 5.49752
## 2              65.1416              100.941                 NA        18976.7         0.656910 5.57173
## 3             360.6170              653.304                 NA        41557.7         0.691451 5.51056
## 4             161.4600              292.662                 NA        42097.8         0.767069 5.47232
## 5             165.5430              274.266                 NA        36324.9         0.350367 5.54726
## 6           13575.6000            15046.300                 NA        23575.0         0.953474 5.44155
##   RT.Start RT.Stop     iRT Predicted.RT Predicted.iRT First.Protein.Description Lib.Q.Value Lib.PG.Q.Value
## 1  5.47311 5.52203 24.1531      5.51437       24.0711                 Septin-11 0.000441969    0.000307314
## 2  5.54731 5.59616 24.1531      5.59037       24.0589                 Septin-11 0.000441969    0.000307314
## 3  5.46164 5.53499 24.1531      5.52646       24.0707                 Septin-11 0.000441969    0.000307314
## 4  5.43564 5.49676 24.1531      5.51338       23.9568                 Septin-11 0.000441969    0.000307314
## 5  5.52279 5.57168 24.0491      5.53787       24.0952                 Septin-11 0.000148948    0.000307314
## 6  5.39265 5.51484 24.0238      5.44345       24.0249                 Septin-11 0.000000001    0.000307314
##   Ms1.Profile.Corr Ms1.Area Evidence Spectrum.Similarity Averagine Mass.Evidence   CScore Decoy.Evidence
## 1         0.551782  31211.4  1.61919            0.222315 1.0000000      0.000000 0.851078       0.000000
## 2         0.785901  20680.9  1.90409            0.117410 0.0987181      0.000000 0.873837       0.000000
## 3         0.766113  48665.3  2.08056            0.395749 0.0591408      0.000000 0.995963       0.000000
## 4         0.846188  47614.9  2.97196            0.533900 1.0000000      0.000000 0.988132       0.000000
## 5         0.665084  43443.8  1.53823            0.198184 0.8581840      0.000000 0.843245       0.000000
## 6         0.976121  18285.4  4.53526            0.438272 1.0000000      0.588394 0.999988       0.830825
##   Decoy.CScore                                                                       Fragment.Quant.Raw
## 1 -1.00000e+07                                               132.008;0;0;69.0042;0;0;0;0;0;131.008;0;0;
## 2 -1.00000e+07                                               67.0042;0;43.0021;153.008;0;0;0;0;0;0;0;0;
## 3 -1.00000e+07                             386.021;131.006;248.013;206.008;0;88.0042;0;61.0021;0;0;0;0;
## 4 -1.00000e+07                                    0;167.008;164.008;258.011;229.01;0;0;166.006;0;0;0;0;
## 5 -1.00000e+07                                   144.008;171.008;0;184.008;83.0042;0;56.0021;0;0;0;0;0;
## 6  1.03618e-01 5303.15;2541.07;3826.1;2771.07;783.019;3142.07;1047.03;863.022;1214.03;850.02;401.011;0;
##                                                                   Fragment.Quant.Corrected
## 1                                               132.008;0;0;69.0042;0;0;0;0;0;131.008;0;0;
## 2                                               67.0042;0;43.0021;153.008;0;0;0;0;0;0;0;0;
## 3                             386.021;131.006;248.013;206.008;0;88.0042;0;61.0021;0;0;0;0;
## 4                                    0;167.008;164.008;258.011;229.01;0;0;166.006;0;0;0;0;
## 5                                   144.008;171.008;0;184.008;83.0042;0;56.0021;0;0;0;0;0;
## 6 5303.15;2541.07;3826.1;2771.07;783.019;3142.07;1047.03;863.022;1214.03;850.02;401.011;0;
##                                                                                 Fragment.Correlations
## 1                                                       0.816497;0;0;0.816497;0;0;0;0;0;0.408248;0;0;
## 2                                                       0.816497;0;0.408248;0.476696;0;0;0;0;0;0;0;0;
## 3                                  0.774108;0.674748;0.571623;0.500944;0;0.721572;0;0.500944;0;0;0;0;
## 4                                         0;0.595407;0.941872;0.832934;0.715275;0;0;0.732418;0;0;0;0;
## 5                                         0.276408;0.816497;0;0.408248;0.816497;0;0.408248;0;0;0;0;0;
## 6 0.968525;0.99425;0.905533;0.900774;0.861883;0.819064;0.804806;0.929188;0.75831;0.662669;0.872515;0;
##   MS2.Scan       IM      iIM Predicted.IM Predicted.iIM
## 1    10775 1.183750 1.185000     1.179540      1.188150
## 2    10919 1.185000 1.185000     1.177780      1.190830
## 3    10799 1.185000 1.185000     1.179370      1.189810
## 4    10727 1.193520 1.185000     1.179870      1.197940
## 5    10871 1.192500 1.183860     1.179140      1.196260
## 6    10663 0.900937 0.897083     0.884704      0.912769
annotation_file = read.csv("../data/diann_annotation.csv")
head(annotation_file)
##                                                Run     Tx Well Unique.Peptides  Plate Condition BioReplicate
## 1   230719_THP-1_Chrom_end2end_Plate1_DMSO_A02_DIA   DMSO  A02           45341 Plate1      DMSO            2
## 2   230719_THP-1_Chrom_end2end_Plate1_DMSO_A05_DIA   DMSO  A05           44666 Plate1      DMSO            5
## 3 230719_THP-1_Chrom_end2end_Plate1_Jakafi_A08_DIA Jakafi  A08           47029 Plate1    Jakafi            8
## 4   230719_THP-1_Chrom_end2end_Plate1_DMSO_A10_DIA   DMSO  A10           46326 Plate1      DMSO           10
## 5   230719_THP-1_Chrom_end2end_Plate1_DMSO_A12_DIA   DMSO  A12           47363 Plate1      DMSO           12
## 6   230719_THP-1_Chrom_end2end_Plate1_DMSO_B01_DIA   DMSO  B01           46873 Plate1      DMSO           13
msstats_format = MSstatsConvert::DIANNtoMSstatsFormat(input_data, 
                                      annotation=annotation_file,
                                      global_qvalue_cutoff = 0.01,
                                      qvalue_cutoff = 0.01,
                                      pg_qvalue_cutoff = 0.01,
                                      useUniquePeptide = TRUE,
                                      removeFewMeasurements = TRUE,
                                      removeOxidationMpeptides = TRUE,
                                      removeProtein_with1Feature = TRUE,
                                      MBR=TRUE)
head(msstats_format)
##   ProteinName               PeptideSequence PrecursorCharge FragmentIon ProductCharge IsotopeLabelType
## 1 SEP11_HUMAN AAAQLLQSQ(UniMod:7)AQQSGAQQTK               2       Frag1             1            Light
## 2 SEP11_HUMAN AAAQLLQSQ(UniMod:7)AQQSGAQQTK               2       Frag3             1            Light
## 3 SEP11_HUMAN AAAQLLQSQ(UniMod:7)AQQSGAQQTK               2       Frag4             1            Light
## 4 SEP11_HUMAN AAAQLLQSQ(UniMod:7)AQQSGAQQTK               2       Frag1             1            Light
## 5 SEP11_HUMAN AAAQLLQSQ(UniMod:7)AQQSGAQQTK               2       Frag3             1            Light
## 6 SEP11_HUMAN AAAQLLQSQ(UniMod:7)AQQSGAQQTK               2       Frag4             1            Light
##   Condition BioReplicate                                            Run Fraction Intensity
## 1      DMSO           10 230719_THP-1_Chrom_end2end_Plate1_DMSO_A10_DIA        1  132.0080
## 2      DMSO           10 230719_THP-1_Chrom_end2end_Plate1_DMSO_A10_DIA        1    0.0000
## 3      DMSO           10 230719_THP-1_Chrom_end2end_Plate1_DMSO_A10_DIA        1   69.0042
## 4      DMSO            5 230719_THP-1_Chrom_end2end_Plate1_DMSO_A05_DIA        1   67.0042
## 5      DMSO            5 230719_THP-1_Chrom_end2end_Plate1_DMSO_A05_DIA        1   43.0021
## 6      DMSO            5 230719_THP-1_Chrom_end2end_Plate1_DMSO_A05_DIA        1  153.0080

1.3 Loading PD Data to MSstats

The imported data from Step 1.1. now must be converted through MSStatsConvert package’s PDtoMSstatsFormat converter.

This function converts the Proteome discoverer output into the required input format for MSstats.

MSstatsConvert::PDtoMSstatsFormat(input,
                              annotation,
                              useNumProteinsColumn=FALSE,
                              useUniquePeptide=TRUE,
                              summaryforMultipleRows=max,
                              fewMeasurements="remove",
                              removeOxidationMpeptides=FALSE,
                              removeProtein_with1Peptide=FALSE,
                              which.quantification = 'Precursor.Area',
                              which.proteinid = 'Protein.Group.Accessions',
                              which.sequence = 'Sequence')

Actual Data modification can be seen below:

pd_imported = MSstatsConvert::PDtoMSstatsFormat(pd, annotation, use_log_file = FALSE)
## INFO  [2024-09-08 20:05:03] ** Raw data from ProteomeDiscoverer imported successfully.
## INFO  [2024-09-08 20:05:03] ** Raw data from ProteomeDiscoverer cleaned successfully.
## INFO  [2024-09-08 20:05:03] ** Using provided annotation.
## INFO  [2024-09-08 20:05:03] ** Run labels were standardized to remove symbols such as '.' or '%'.
## INFO  [2024-09-08 20:05:03] ** The following options are used:
##   - Features will be defined by the columns: PeptideSequence, PrecursorCharge
##   - Shared peptides will be removed.
##   - Proteins with single feature will not be removed.
##   - Features with less than 3 measurements across runs will be removed.
## INFO  [2024-09-08 20:05:03] ** Features with all missing measurements across runs are removed.
## INFO  [2024-09-08 20:05:03] ** Shared peptides are removed.
## INFO  [2024-09-08 20:05:03] ** Multiple measurements in a feature and a run are summarized by summaryforMultipleRows: max
## INFO  [2024-09-08 20:05:03] ** Features with one or two measurements across runs are removed.
## INFO  [2024-09-08 20:05:03] ** Run annotation merged with quantification data.
## INFO  [2024-09-08 20:05:03] ** Features with one or two measurements across runs are removed.
## INFO  [2024-09-08 20:05:03] ** Fractionation handled.
## INFO  [2024-09-08 20:05:03] ** Updated quantification data to make balanced design. Missing values are marked by NA
## INFO  [2024-09-08 20:05:03] ** Finished preprocessing. The dataset is ready to be processed by the dataProcess function.
head(pd_imported)
##   ProteinName PeptideModifiedSequence PrecursorCharge FragmentIon ProductCharge IsotopeLabelType  Condition
## 1      P0ABU9         ANSHAPEAVVEGASR               2          NA            NA                L Condition1
## 2      P0ABU9         ANSHAPEAVVEGASR               2          NA            NA                L Condition1
## 3      P0ABU9         ANSHAPEAVVEGASR               2          NA            NA                L Condition1
## 4      P0ABU9         ANSHAPEAVVEGASR               2          NA            NA                L Condition2
## 5      P0ABU9         ANSHAPEAVVEGASR               2          NA            NA                L Condition2
## 6      P0ABU9         ANSHAPEAVVEGASR               2          NA            NA                L Condition2
##   BioReplicate                                            Run Fraction Intensity
## 1            1 121219_S_CCES_01_01_LysC_Try_1to10_Mixt_1_1raw        1  21400000
## 2            1 121219_S_CCES_01_02_LysC_Try_1to10_Mixt_1_2raw        1  17500000
## 3            1 121219_S_CCES_01_03_LysC_Try_1to10_Mixt_1_3raw        1        NA
## 4            2 121219_S_CCES_01_04_LysC_Try_1to10_Mixt_2_1raw        1  11600000
## 5            2 121219_S_CCES_01_05_LysC_Try_1to10_Mixt_2_2raw        1  12000000
## 6            2 121219_S_CCES_01_06_LysC_Try_1to10_Mixt_2_3raw        1  16200000

1.4 Data Process

Once we import the dataset correctly with Formatter, we need to pre-process the data which is taken care by dataProcess.

This step is essentially Data pre-processing and quality control of MS runs of the original raw data into quantitative data for model fitting and group comparison.

There are 3 main steps that happen in background :

  • Normalization - There are three different normalizations supported. ‘equalizeMedians’ (default) represents constant normalization (equalizing the medians) based on reference signals is performed. ‘quantile’ represents quantile normalization based on reference signals is performed. ‘globalStandards’ represents normalization with global standards proteins. FALSE represents no normalization is performed.

  • Feature selection - This also has three options i.e. Select All features, Top-N features (by mean intensity) or “Best” features.

  • Missing value imputation - We impute plausible values in case of missing data points. The RunLevelData can be queried to show Number of imputed intensities (censored intensities) in a RUN and Protein.

summarized = dataProcess(pd_imported)
## INFO  [2024-09-08 20:05:03] ** Log2 intensities under cutoff = 23.053  were considered as censored missing values.
## INFO  [2024-09-08 20:05:03] ** Log2 intensities = NA were considered as censored missing values.
## INFO  [2024-09-08 20:05:03] ** Use all features that the dataset originally has.
## INFO  [2024-09-08 20:05:03] 
##  # proteins: 5
##  # peptides per protein: 1-16
##  # features per peptide: 1-1
## INFO  [2024-09-08 20:05:03] Some proteins have only one feature: 
##  P00363,
##  P0A8J2 ...
## INFO  [2024-09-08 20:05:03] 
##                     Condition1 Condition2 Condition3 Condition4 Condition5
##              # runs          3          3          3          3          3
##     # bioreplicates          1          1          1          1          1
##  # tech. replicates          3          3          3          3          3
## INFO  [2024-09-08 20:05:03] Some features are completely missing in at least one condition:  
##  LDEGcTERC5(Carbamidomethyl)_2_NA_NA,
##  ELREQVGDEHIGVIPEDcYYKC18(Carbamidomethyl)_3_NA_NA,
##  TNYDHPSAMDHSLLLEHLQALK_3_NA_NA,
##  LARPGSDVALDDQLYQEPQAAPVAVPMGK_3_NA_NA,
##  AYLATQGVEIR_2_NA_NA ...
## INFO  [2024-09-08 20:05:03]  == Start the summarization per subplot...
##   |                                                                                                            |                                                                                                    |   0%  |                                                                                                            |====================                                                                                |  20%  |                                                                                                            |========================================                                                            |  40%  |                                                                                                            |============================================================                                        |  60%  |                                                                                                            |================================================================================                    |  80%  |                                                                                                            |====================================================================================================| 100%
## INFO  [2024-09-08 20:05:04]  == Summarization is done.
head(summarized$FeatureLevelData)
##   PROTEIN           PEPTIDE TRANSITION                 FEATURE LABEL      GROUP RUN SUBJECT FRACTION
## 1  P0ABU9 ANSHAPEAVVEGASR_2      NA_NA ANSHAPEAVVEGASR_2_NA_NA     L Condition1   1       1        1
## 2  P0ABU9 ANSHAPEAVVEGASR_2      NA_NA ANSHAPEAVVEGASR_2_NA_NA     L Condition1   2       1        1
## 3  P0ABU9 ANSHAPEAVVEGASR_2      NA_NA ANSHAPEAVVEGASR_2_NA_NA     L Condition1   3       1        1
## 4  P0ABU9 ANSHAPEAVVEGASR_2      NA_NA ANSHAPEAVVEGASR_2_NA_NA     L Condition2   4       2        1
## 5  P0ABU9 ANSHAPEAVVEGASR_2      NA_NA ANSHAPEAVVEGASR_2_NA_NA     L Condition2   5       2        1
## 6  P0ABU9 ANSHAPEAVVEGASR_2      NA_NA ANSHAPEAVVEGASR_2_NA_NA     L Condition2   6       2        1
##                                      originalRUN censored INTENSITY ABUNDANCE newABUNDANCE predicted
## 1 121219_S_CCES_01_01_LysC_Try_1to10_Mixt_1_1raw    FALSE  21400000  23.71945     23.71945        NA
## 2 121219_S_CCES_01_02_LysC_Try_1to10_Mixt_1_2raw    FALSE  17500000  24.06085     24.06085        NA
## 3 121219_S_CCES_01_03_LysC_Try_1to10_Mixt_1_3raw     TRUE        NA        NA     22.77604  22.77604
## 4 121219_S_CCES_01_04_LysC_Try_1to10_Mixt_2_1raw    FALSE  11600000  23.77304     23.77304        NA
## 5 121219_S_CCES_01_05_LysC_Try_1to10_Mixt_2_2raw     TRUE  12000000  23.00805     22.95207  22.95207
## 6 121219_S_CCES_01_06_LysC_Try_1to10_Mixt_2_3raw    FALSE  16200000  23.74312     23.74312        NA
head(summarized$ProteinLevelData)
##   RUN Protein LogIntensities                                    originalRUN      GROUP SUBJECT
## 1   1  P0A8F4       22.96185 121219_S_CCES_01_01_LysC_Try_1to10_Mixt_1_1raw Condition1       1
## 2   2  P0A8F4       23.27048 121219_S_CCES_01_02_LysC_Try_1to10_Mixt_1_2raw Condition1       1
## 3   3  P0A8F4       23.44357 121219_S_CCES_01_03_LysC_Try_1to10_Mixt_1_3raw Condition1       1
## 4   4  P0A8F4       23.31217 121219_S_CCES_01_04_LysC_Try_1to10_Mixt_2_1raw Condition2       2
## 5   6  P0A8F4       23.87516 121219_S_CCES_01_06_LysC_Try_1to10_Mixt_2_3raw Condition2       2
## 6   8  P0A8F4       24.31958 121219_S_CCES_01_08_LysC_Try_1to10_Mixt_3_2raw Condition3       3
##   TotalGroupMeasurements NumMeasuredFeature MissingPercentage more50missing NumImputedFeature
## 1                      9                  1         0.6666667          TRUE                 2
## 2                      9                  1         0.6666667          TRUE                 2
## 3                      9                  1         0.6666667          TRUE                 2
## 4                      9                  1         0.6666667          TRUE                 2
## 5                      9                  1         0.6666667          TRUE                 2
## 6                      9                  2         0.3333333         FALSE                 1
head(summarized$SummaryMethod)
## [1] "TMP"

1.4.1 Data Process Plots

After dataProcessing the input data, MSstats provides multiple plots to analyze the pre-processed data. Here we show the various types of plots we can use. Once invoked, a pdf file will be downloaded with corresponding feature level data and the Plot generated.

# Profile plot
dataProcessPlots(data=summarized, type="ProfilePlot")

# Quality control plot
dataProcessPlots(data=summarized, type="QCPlot")

# Quantification plot for conditions
dataProcessPlots(data=summarized, type="ConditionPlot")

1.5 Modeling

In this step we test for significant changes in protein abundance across conditions based on a family of linear mixed-effects models in targeted Selected Reaction Monitoring (SRM), Data-Dependent Acquisition (DDA or shotgun), and Data-Independent Acquisition (DIA or SWATH-MS) experiment.

It is applicable to multiple types of sample preparation, including label-free workflows, workflows that use stable isotope labeled reference proteins and peptides, and workflows that use fractionation. Experimental design of case-control study (patients are not repeatedly measured) or time course study (patients are repeatedly measured) is automatically determined based on proper statistical model.

# Model
labels = unique(summarized$ProteinLevelData$GROUP)
contrast_matrix = MSstatsContrastMatrix('pairwise', labels)

model = groupComparison(contrast_matrix, summarized)
## INFO  [2024-09-08 20:05:09]  == Start to test and get inference in whole plot ...
##   |                                                                                                            |                                                                                                    |   0%  |                                                                                                            |=========================                                                                           |  25%  |                                                                                                            |==================================================                                                  |  50%  |                                                                                                            |===========================================================================                         |  75%  |                                                                                                            |====================================================================================================| 100%
## INFO  [2024-09-08 20:05:09]  == Comparisons for all proteins are done.

Model Details

head(model$ModelQC)
##   RUN Protein ABUNDANCE                                    originalRUN      GROUP SUBJECT
## 1   1  P0A8F4  22.96185 121219_S_CCES_01_01_LysC_Try_1to10_Mixt_1_1raw Condition1       1
## 2   2  P0A8F4  23.27048 121219_S_CCES_01_02_LysC_Try_1to10_Mixt_1_2raw Condition1       1
## 3   3  P0A8F4  23.44357 121219_S_CCES_01_03_LysC_Try_1to10_Mixt_1_3raw Condition1       1
## 4   4  P0A8F4  23.31217 121219_S_CCES_01_04_LysC_Try_1to10_Mixt_2_1raw Condition2       2
## 5   6  P0A8F4  23.87516 121219_S_CCES_01_06_LysC_Try_1to10_Mixt_2_3raw Condition2       2
## 6   8  P0A8F4  24.31958 121219_S_CCES_01_08_LysC_Try_1to10_Mixt_3_2raw Condition3       3
##   TotalGroupMeasurements NumMeasuredFeature MissingPercentage more50missing NumImputedFeature   residuals
## 1                      9                  1         0.6666667          TRUE                 2 -0.26344817
## 2                      9                  1         0.6666667          TRUE                 2  0.04517815
## 3                      9                  1         0.6666667          TRUE                 2  0.21827003
## 4                      9                  1         0.6666667          TRUE                 2 -0.28149357
## 5                      9                  1         0.6666667          TRUE                 2  0.28149357
## 6                      9                  2         0.3333333         FALSE                 1  0.66038185
##     fitted
## 1 23.22530
## 2 23.22530
## 3 23.22530
## 4 23.59366
## 5 23.59366
## 6 23.65919

1.5.1 groupComparisonPlot

Visualization for model-based analysis and summarizing differentially abundant proteins. To summarize the results of log-fold changes and adjusted p-values for differentially abundant proteins, groupComparisonPlots takes testing results from function groupComparison as input and automatically generate three types of figures in pdf files as output :

  • Volcano plot : For each comparison separately. It illustrates actual log-fold changes and adjusted p-values for each comparison separately with all proteins. The x-axis is the log fold change. The base of logarithm transformation is the same as specified in “logTrans” from dataProcess. The y-axis is the negative log2 or log10 adjusted p-values. The horizontal dashed line represents the FDR cutoff. The points below the FDR cutoff line are non-significantly abundant proteins (colored in black). The points above the FDR cutoff line are significantly abundant proteins (colored in red/blue for up-/down-regulated). If fold change cutoff is specified (FCcutoff = specific value), the points above the FDR cutoff line but within the FC cutoff line are non-significantly abundant proteins (colored in black).

  • Heatmap : For multiple comparisons. It illustratea up-/down-regulated proteins for multiple comparisons with all proteins. Each column represents each comparison of interest. Each row represents each protein. Color red/blue represents proteins in that specific comparison are significantly up-regulated/down-regulated proteins with FDR cutoff and/or FC cutoff. The color scheme shows the evidences of significance. The darker color it is, the stronger evidence of significance it has. Color gold represents proteins are not significantly different in abundance.

  • Comparison plot : For multiple comparisons per protein. It illustrates log-fold change and its variation of multiple comparisons for single protein. X-axis is comparison of interest. Y-axis is the log fold change. The red points are the estimated log fold change from the model. The error bars are the confidence interval with 0.95 significant level for log fold change. This interval is only based on the standard error, which is estimated from the model.

groupComparisonPlots(
  model$ComparisonResult,
  type="Heatmap",
  sig = 0.05,
  FCcutoff = FALSE,
  logBase.pvalue = 10,
  ylimUp = FALSE,
  ylimDown = FALSE,
  xlimUp = FALSE,
  x.axis.size = 10,
  y.axis.size = 10,
  dot.size = 3,
  text.size = 4,
  text.angle = 0,
  legend.size = 13,
  ProteinName = TRUE,
  colorkey = TRUE,
  numProtein = 100,
  clustering = "both",
  width = 800,
  height = 600,
  which.Comparison = "all",
  which.Protein = "all",
  address = FALSE,
  isPlotly = TRUE
)

1.6 Sample Size Calculation

Calculate sample size for future experiments of a Selected Reaction Monitoring (SRM), Data-Dependent Acquisition (DDA or shotgun), and Data-Independent Acquisition (DIA or SWATH-MS) experiment based on intensity-based linear model. The function fits the model and uses variance components to calculate sample size. The underlying model fitting with intensity-based linear model with technical MS run replication. Estimated sample size is rounded to 0 decimal. Two options of the calculation:

  • number of biological replicates per condition
  • power
sample_size_calc = designSampleSize(model$FittedModel,
                                    desiredFC=c(1.75,2.5),
                                    power = TRUE,
                                    numSample=5)

1.6.1 Sample Size Calculation Plot

To illustrate the relationship of desired fold change and the calculated minimal number sample size which are

Number of biological replicates per condition power. The input is the result from function designSampleSize.

designSampleSizePlots(sample_size_calc, isPlotly=FALSE)

1.7 Quantification from groupComparison Data

Model-based quantification for each condition or for each biological samples per protein in a targeted Selected Reaction Monitoring (SRM), Data-Dependent Acquisition (DDA or shotgun), and Data-Independent Acquisition (DIA or SWATH-MS) experiment. Quantification takes the processed data set by dataProcess as input and automatically generate the quantification results (data.frame) with long or matrix format. The quantification for endogenous samples is based on run summarization from subplot model, with TMP robust estimation.

  • Sample quantification : individual biological sample quantification for each protein. The label of each biological sample is a combination of the corresponding group and the sample ID. If there are no technical replicates or experimental replicates per sample, sample quantification is the same as run summarization from dataProcess (xxx$RunlevelData from dataProcess). If there are technical replicates or experimental replicates, sample quantification is median among run quantification corresponding MS runs.

  • Group quantification : quantification for individual group or individual condition per protein. It is median among sample quantification.

sample_quant_long = quantification(summarized,
                             type = "Sample",
                             format = "long")
sample_quant_long
##     Protein Group_Subject LogIntensity
##      <fctr>        <fctr>        <num>
##  1:  P0A8F4  Condition1_1     23.27048
##  2:  P0A8J2  Condition1_1     25.41377
##  3:  P0ABU9  Condition1_1     23.94076
##  4:  P60664  Condition1_1     26.95914
##  5:  P0A8F4  Condition2_2     23.59366
##  6:  P0A8J2  Condition2_2     25.37768
##  7:  P0ABU9  Condition2_2     24.35179
##  8:  P60664  Condition2_2     27.36184
##  9:  P0A8F4  Condition3_3     23.65919
## 10:  P0A8J2  Condition3_3     24.84218
## 11:  P0ABU9  Condition3_3     23.97927
## 12:  P60664  Condition3_3     26.89201
## 13:  P0A8F4  Condition4_4     24.04638
## 14:  P0A8J2  Condition4_4           NA
## 15:  P0ABU9  Condition4_4     24.96019
## 16:  P60664  Condition4_4     27.69317
## 17:  P0A8F4  Condition5_5     24.50374
## 18:  P0A8J2  Condition5_5           NA
## 19:  P0ABU9  Condition5_5     25.42248
## 20:  P60664  Condition5_5     27.98325
##     Protein Group_Subject LogIntensity
sample_quant_wide = quantification(summarized,
                              type = "Sample",
                              format = "matrix")
sample_quant_wide
## Key: <Protein>
##    Protein Condition1_1 Condition2_2 Condition3_3 Condition4_4 Condition5_5
##     <fctr>        <num>        <num>        <num>        <num>        <num>
## 1:  P0A8F4     23.27048     23.59366     23.65919     24.04638     24.50374
## 2:  P0A8J2     25.41377     25.37768     24.84218           NA           NA
## 3:  P0ABU9     23.94076     24.35179     23.97927     24.96019     25.42248
## 4:  P60664     26.95914     27.36184     26.89201     27.69317     27.98325
group_quant_long = quantification(summarized,
                                  type = "Group",
                                  format = "long")
group_quant_long
##     Protein      Group LogIntensity
##      <fctr>     <fctr>        <num>
##  1:  P0A8F4 Condition1     23.27048
##  2:  P0A8J2 Condition1     25.41377
##  3:  P0ABU9 Condition1     23.94076
##  4:  P60664 Condition1     26.95914
##  5:  P0A8F4 Condition2     23.59366
##  6:  P0A8J2 Condition2     25.37768
##  7:  P0ABU9 Condition2     24.35179
##  8:  P60664 Condition2     27.36184
##  9:  P0A8F4 Condition3     23.65919
## 10:  P0A8J2 Condition3     24.84218
## 11:  P0ABU9 Condition3     23.97927
## 12:  P60664 Condition3     26.89201
## 13:  P0A8F4 Condition4     24.04638
## 14:  P0A8J2 Condition4           NA
## 15:  P0ABU9 Condition4     24.96019
## 16:  P60664 Condition4     27.69317
## 17:  P0A8F4 Condition5     24.50374
## 18:  P0A8J2 Condition5           NA
## 19:  P0ABU9 Condition5     25.42248
## 20:  P60664 Condition5     27.98325
##     Protein      Group LogIntensity
group_quant_wide = quantification(summarized,
                                  type = "Group",
                                  format = "matrix")
group_quant_wide
## Key: <Protein>
##    Protein Condition1 Condition2 Condition3 Condition4 Condition5
##     <fctr>      <num>      <num>      <num>      <num>      <num>
## 1:  P0A8F4   23.27048   23.59366   23.65919   24.04638   24.50374
## 2:  P0A8J2   25.41377   25.37768   24.84218         NA         NA
## 3:  P0ABU9   23.94076   24.35179   23.97927   24.96019   25.42248
## 4:  P60664   26.95914   27.36184   26.89201   27.69317   27.98325